-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torchscript benchmark measure #6907
Torchscript benchmark measure #6907
Conversation
Results for 1):
For the smaller sequence length 128 we can see a significant speed-up (~30%) - for the longer sequence length 512, the speed-up is much smaller (and only for the bigger list of inputs). |
Results for 2)
Here no clear speed gains can be seen. |
I'm not sure I understand all the interactions in the benchmarking framework, but I think in line 9 (non-script model) we should be returning torch.jit.trace(model, sample_input), not the untraced model. And the sample input would have be max_length for it to work. That's were most of the gain comes from. |
Okey, yeah that makes sense! I changed the benchmarking script accordingly and have the following results now:
and
=> So my understanding is now that also cc @sgugger @LysandreJik |
We saw different behavior in our experiments a few months ago. Will try to reproduce and update here. |
Was |
In our experiments, using trace(model, example_input) would result in a model that would only accept a sequence of the same length as example_sequence, whereas script(model) had no such restriction. This is the case mentioned in your documentation here: https://huggingface.co/transformers/torchscript.html#dummy-inputs-and-standard-lengths What that meant in practice is that you needed to trace with an example sequence of length = max_length, and then pad every example of length < max_length with zeros. Since the speed of the model is basically linear in the sequence length, for a set of inputs with varying sequence lengths we got a speed up of avg_len/max_length by using script() instead of trace(). Upon further investigation, it looks like when we ran these experiments, several months ago, we were using Torch 1.2. It looks like in Torch 1.3 the fixed-length problem is no longer an issue for your BERT models (we still encounter it with other models architectures we build). So there's no longer a big speed gain from script() vs trace(). There are still some good reasons for preferring script() to trace() - scripting is guaranteed to capture the model codepath logic, whereas tracing might miss a logic branch if the example input doesn't flow through it. Also, currently tracing your models produces several warnings like the one below. But I'm not sure if those on their own are enough of a motivation to make major changes in your code base.
|
@sgugger - what are your thoughts on this? |
I think adding the scriptable layers seems cleaner to make sure everything works right with scripting/tracing. Not the approach in this PR but the other linked in a comment (@sbrody18 I don't know if you saw my PR to rebase on master for this branch). It ends up with most changes being helpful to read the code (type annotations and asserts) and a few extra classes for the scriptable layers but not much added code. |
@sgugger I agree - I think the extra benefit of the type and None-checking is really helpful to prevent bugs and makes the code better. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This PR is just there to show some benchmarking results of
BertScriptableModel
vs.BertModel
. It shows the results of running the script:benchmark_pytorch_scripting.py
.In a nutshell, the script does the following:
Create a list of 500 and 2500
input_tensors
ofbatch_size
1 with a sequence length varying between 1 and 128 or 1 and 512.Then take a scripted model
model = torch.jit.script(BertScriptableModel(...))
and loop over all 500 / 2500input_tensors
in a standard for loop. The script model is warmed up by running the loop 5 times before measuring the time. The loop is run 10 times and the fastest run is taken as a measurement.Create a list of 64 and 512 input_tensors of batch_size 8 with a sequence length varying between 1 and 128 or 1 and 512.
Then take a scripted model
model = torch.jit.script(BertScriptableModel(...))
and loop over all 64 / 512input_tensors
in a standard for loop. The script model is warmed up by running the loop 5 times before measuring the time. The loop is run 10 times and the fastest run is taken as a measurement.All this was done on the following environment information:
=> So only on GPU.
To run this script, one can simply run:
Important:
The "for" loop corresponds to the function defined in lines 32 - 37 of the file
benchmark_pytorch_scripting.py
.This function then overwrites the function that is usually measured in benchmarks, by setting
benchmark._prepare_inference_func = _prepare_inference_func
in line 49.It would be awesome if @sbrody18 could take a look at the
benchmark_pytorch_scripting.py
f file to check if torchscript was used correctly.